[Perf] Early return in KVCacheManager.allocate_slots #29206

Jialin · 2025-11-21T21:49:43Z

Purpose

As the title, early return in KVCacheManager.allocate_slots to speed up scheduling, as most of the requests only allocate a block every 16 steps.

Test Plan && Test Result

With the change, we could observed allocate slots cost is cut by half from 17.24% to 8.24% over the total runtime (with async scheduling enabled)

Total runtime distribution for allocate_slots per engine step

Public benchmarks

Before: 171.53s
After: 165.90s (~3.3% total runtime reduction)

vllm bench latency --model facebook/opt-125m \
  --dtype bfloat16 \
  --batch-size 3000 \
  --input-len 128 \
  --output-len 1800 \
  -tp 1 \
  --num-iters-warmup 1 \
  --num-iters 3

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Jialin Ouyang <[email protected]>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-11-21T22:26:15Z

vllm/v1/core/kv_cache_manager.py

+        if (
+            num_blocks_to_allocate == 0
+            and new_computed_block_list is self.empty_kv_cache_blocks.blocks
+        ):
+            # Early return as no new blocks needed to be allocated
+            return self.empty_kv_cache_blocks


Skipped sliding-window cleanup when no new blocks allocated

allocate_slots now returns early when num_blocks_to_allocate==0 and no prefix-cache hits, so coordinator.remove_skipped_blocks() is never called on those steps. For sliding-window/ChunkedLocal attention this call is what frees blocks that have fallen outside the window; skipping it leaves those blocks held until the request hits a block boundary or finishes, inflating block_pool usage and causing unnecessary evictions or scheduling failures for long prompts that keep generating within an existing block.

Useful? React with 👍 / 👎.

heheda12345 · 2025-11-22T00:23:46Z

vllm/v1/core/kv_cache_manager.py


+        if (
+            num_blocks_to_allocate == 0
+            and new_computed_block_list is self.empty_kv_cache_blocks.blocks


Suggested change

and new_computed_block_list is self.empty_kv_cache_blocks.blocks

and new_computed_blocks is None

prefer this slightly.

And do we need to add num_computed_tokens > request.num_prompt_tokens to make remove_skipped_blocks to be called in the first decode step? This can help to free the prefill tokens used by the last prefill step but is outside the sliding window of the first decode step. Would be grateful if you can try gpt-oss and gemma3, two models with small sliding window size.

And can you add a comment for this may delay remove_skipped_blocks and cache_blocks and give some analysis about it is fine.

Hmmm, great call. I didn't notice the cache_blocks logic. Let me think about the early return logic more :/

Signed-off-by: Jialin Ouyang <[email protected]>

[Perf] Early return in KVCacheManager.allocate_slots

359d09d

Signed-off-by: Jialin Ouyang <[email protected]>

mergify bot added the v1 label Nov 21, 2025

Jialin requested review from 22quinn, heheda12345 and zhuohan123 November 21, 2025 22:22

Jialin marked this pull request as ready for review November 21, 2025 22:23

Jialin requested review from ApostaC, WoosukKwon, alexm-redhat, njhill, robertgshaw2-redhat and ywang96 as code owners November 21, 2025 22:23

chatgpt-codex-connector bot reviewed Nov 21, 2025

View reviewed changes

heheda12345 reviewed Nov 22, 2025

View reviewed changes

State side effect explicitly

43a1f8d

Signed-off-by: Jialin Ouyang <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Perf] Early return in KVCacheManager.allocate_slots #29206

[Perf] Early return in KVCacheManager.allocate_slots #29206

Jialin commented Nov 21, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Nov 21, 2025

Uh oh!

heheda12345 Nov 22, 2025

Uh oh!

heheda12345 Nov 22, 2025

Uh oh!

Jialin Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	and new_computed_block_list is self.empty_kv_cache_blocks.blocks
	and new_computed_blocks is None

Uh oh!

[Perf] Early return in KVCacheManager.allocate_slots #29206

Are you sure you want to change the base?

[Perf] Early return in KVCacheManager.allocate_slots #29206

Conversation

Jialin commented Nov 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan && Test Result

Public benchmarks

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

Jialin Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Jialin commented Nov 21, 2025 •

edited by github-actions bot

Loading